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(54) Encoder for an end-to-end scalable video delivery system 



(57) A software-based encoder is provided for an 
end-to-end scalable video delivery system that operates 
over heterogeneous networks. The encoder utilizes a 
scalable video compression algorithm based on a La- 
placian pyramid decomposition to generate an embed- 
ded information stream. The decoder decimates a high- 
est resolution original image, e.g., 640x480 pixels, to 
produce an intermediate 320x240 pixel image that is 
decimated to produce an intermediate 1 60x1 20 pixel im- 
age that is compressed to form an encodable base layer 
160x120 pixel image. This base layer image is decom- 
pressed to form an image that is up-sampled by inter- 
polation to produce an up-sampled 320x240 pixel im- 
age. This up-sampled image is subtracted from the in- 
termediate 320x240 pixel image to form an error image 
that is compressed and encoded as a first enhancement 
640x480 pixel layer. The decompressed base layer im- 
age is also up-sampled at step to produce an up-sam- 
pled 640x480 pixel image that is subtracted from the 
original 640x480 pixel image 200 to yield an error image 
that is compressed to yield a second enhancement 
320x240 pixel layer. Collectively, the base and enhance- 
ment layers comprise the transmitted embedded bit 
stream. At the receiving end, the decoder extracts from 
the embedded stream different streams at different spa- 
tial and temporal resolutions. Because decoding re- 
quires only additions and look-ups from a small stored 
table, decoding occurs in real-time. 
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Description 

FIELD OF THE INVENTION 



The present invention relates generally to video delivery systems, and more specifically to encoders for such 
systems that will permit video to be delivered scalably, so as to maximize use of network resources and to minimize 
user-contention conflicts. 

BACKGROUND OF THE INVENTION 

It is known in the art to use server-client networks to provide video to end users, wherein the server issues a 
separate video stream for each individual client. 

A library of video sources is maintained at the server end. Chosen video selections are signal processed by a 
server encoder stored on digital media, and are then transmitted over a variety of networks, perhaps on an basis that 
allows a remote viewer to interact with the video. The video may be stored on media that includes magnetic disk, CD- 
ROM, and the stored information can include video, speech, and images. As such, the source video information may 
have been stored in one of several spatial resolutions (e.g., 160x120, 320x240, 640x480 pixels), and temporal reso- 
lutions (e.g., 1 to 30 frames per second). The source video may present bandwidths whose dynamic range can vary 
from 10 Kbps to 10 Mbps. 

The signal processed video is transmitted to the clients (or decoders) over one or more delivery networks that may 
be heterogeneous, e.g., have widely differing bandwidths. For example, telephone delivery lines can transmit at only 
a few tens of Kbps, an ISDN network can handle 128 Kbps, ethernet at 10 Mbps, whereas ATM networks handle even 
higher transmission rates. 

Although the source video has varying characteristics, prior art video delivery systems operate with a system 
bandwidth that is static or fixed. Although such system bandwidths are fixed, in practice, the general purpose computing 
environment associated with the systems are dynamic, and variations in the networks can also exist. These variations 
can arise from the outright lack of resources (e.g., limited network bandwidth and processor cycles), contention for 
available resources due to congestion, or a user's unwillingness to allocate needed resources to the task. 

Prior art systems tend to be very computationally intensive, especially with respect to decoding images of differing 
resolutions. For example, where a prior art encoder transmits a bit stream of, say, 320x240 pixel resolution, but the 
decoder requires 160x120 pixel resolution, several processes must be invoked, involving decompression,' entropy 
coding, quantization, discrete cosine transformation and down-sampling. Collectively, these steps require too long to 
be accomplished in real-time. 

Color conversions, e.g., YUV-to-RGB are especially computationally intensive, in the prior. In another situation, 
an encoder may transmit 24 bits, representing 1 6 million colors, but a recipient decoder may be coupled to a PC having 
an 8 bit display, capable of only 256 colors. The decoder must then dither the incoming data, which is a computationally 
intensive task. 

Unfortunately fixed bandwidth prior art systems cannot make full use of such dynamic environments and system 
variations. The result is slower throughput and more severe contention for a given level of expenditure for system 
hardware and software. When congestion (e.g., a region of constrained bandwidth) is present on the network, packets 
of transmitted information will be randomly dropped, with the result that no useful information may be received by the 
client. 7 

Video information is extremely storage intensive, and compression is necessary during storage and transmission 
Although scalable compression would be beneficial, especially for browsing in multimedia video sources, existing com- 
pression systems do not provide desired properties for scalable compression. By scalable compression it is meant that 
a full dynamic range of spatial and temporal resolutions should be provided on a single embedded video stream that 
is output by the server over the network(s). Acceptable software-based scalable techniques are not found in the prior 
art For example, the MPEG-2 compression standard offers limited extent scalability, but lacks sufficient dynamic range 
oi bandwidth, is costly to implement in software, and uses variable length codes that require additional error correction 
support. 

Further, prior art compression standards typically require dedicated hardware at the encoding end, e.g., an MPEG 
,k ,1 MPEG com P fession standard. While some prior art encoding techniques are software-based and operate 
w.mout ded,cated hardware (other than a fast central processing unit), known software-based approaches are too 
computational mtens.ve to operate in real-time. For example. JPEG software running on a SparcStation 10 workstation 
can nandle only 2-3 frames/second, e.g.. about 1% of the frame/second capability of the present invention. 

considerable video server research in the prior art has focussed on scheduling policies for on-demand situations 
hITh ° n c ° nUo l and RA,D issues - Pri ° f art encoder operation typically is dependent upon the characteristics of the 
ci.ent decoders. Simply stated, relatively little work has been directed to video server systems operable over hetero- 



geneous networks having differing bandwidth capabilities, where host decoders have various spatial and temporal 
resolutions. 

In summary, there is a need for a video delivery system that provides nd-to-end video encoding such that the 
server outputs a single embedded data stream from which decoders may extract video having different spatial reso- 
lutions, temporal resolutions and data rates. The encoder should be software-based and provide video compression 
that is bandwidth scalable, and thus deliverable over heterogeneous networks whose transmission rates vary from 
perhaps 10 Kbps to 10 Mbps. Such a system should accommodate lower bandwidth links or congestion, and should 
permit the encoder to operate independently of decoder capability or requirements. 

The decoder for such system should be software-based (e.g., not require specialized dedicated hardware beyond 
a comput.ng system) or should be implemented using inexpensive read-only memory type hardware, and should permit 
real-time decompression. The system should permit user selection of a delivery bandwidth to choose the most appro- 
bate point in spatial resolution, temporal resolution, data-rate and in quality space. The system should also provide 
subjective v.deo quality enhancement, and should include error resilience to allow for communication errors. 

The present invention provides a software-based encoder for such a system. 

SUMMARY OF THE INVENTION 

The present invention provides a software-based server-encoder for an end-to-end scalable video delivery system 
wherein the server-encoder operates independently of the capabilities and requirements of the software-based decoder 
(s). The encoder uses a scalable compression algorithm based upon Laplacian pyramid decomposition An original 
640x480 pixel image is decimated to produce a 320x240 pixel image that is itself decimated to yield a 160x120 pixel 
base image that is encoder-transmitted. 

This base image is then compressed to form a 160x120 pixel base layer, that is decompressed and up-sampled 
to produce an up-sampled 320x240 pixel image. The up-sampled 320x240 pixel image is then subtracted from the 
320x240 pixel .mage to provide an error image that is compressed as transmitted as a first enhancement layer The 
160x120 pixel decompressed image is also up-sampled to produce an up-sampled 640x480 pixel image that is sub- 
tracted from the original 640x480 pixel image to yield an error image that is compressed and transmitted as a second 
enhancement layer. 

Collectively the base layer, and first and second enhancement layers comprise the single embedded bitstream 
that may be mult.cast over heterogeneous networks that can range from telephone lines to wireless transmission 
Packets within the embedded bit-stream preferably a.e prioritized with bits arranged in order of visual importance The 
resultant bit stream is easily rescaled by dropping less important bits, thus providing bandwidth scalability dynamic 
range from a few Kbps to many Mbps. Further, such embedded bit stream permits the server system to accommodate 
a plurality of users whose decoder systems have differing characteristics. The transmitting end also includes a market- 
based mechanism for resolving conflicts in providing an end-to-end scalable video delivery service to the user. 

At the receiving end, decoders of varying characteristics can extract different streams at different spatial and tem- 
poral resolutions from the single embedded bit stream. Decoding a 160x120 pixel image involves only decompressing 
the base layer 160x120 pixel image. Decoding a 320x240 pixel image involves decompressing and up-sampling the 
base layer to yield a 320x240 pixel image to which is added error data in the first enhancement layer following its 
decompression. To obtain a 640x480 pixel image, the decoder up-samples the up-sampled 320x240 pixel image to 
which is added error data in the second enhancement layer, following its decompression. Thus, decoding is fast and 
requires only table look-ups and additions. Subjective quality of the compressed images is enhanced using perceptual 
distortion measures. The system also provides joint-source channel coding capability on heterogenous networks 

Other features and advantages of embodiments of the invention will appear from the following description in which 
me preferred embodiments have been set forth in detail, in conjunction with the accompanying drawings. 

BRIEF DESC RIPTION OF THE DRAWINGS 

r G K^ E J iS 3 b '° Ck di39ram ° f an end - t0 " end sca| able video system, in which the present invention may be 

emuooiea; 

FIGURE 2 is a block/flow diagram depicting a software-based encoder that generates a scalable embedded video 
stream, embodying the present invention: 

FIGURE 3 is a block/flow diagram depicting a decoder recovery of scalable video from a single embedded video 

siream. 



DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS 



Figure 1 depicts an end-to-end scalable video delivery system, including a software-based encoder embodying 
the present invention. A source of audio and video information 10 is coupled to a server or encoder 20 The encoder 
signal Presses the information to produce a single embedded information stream that is transmitted via homogeneous 
networks 30, 30 to one or more target clients or software-based decoder systems 40, 40". which decoder uses minimal 
central processor unit resources. Network transmission may be through a soolled network cloud 50 from which the 
tTpo^r ^ ' nf0rmati0n S,ream iS mul,iCaSt l ° the decoders ' ° r transmission to the decoders 40' may be point- 

W « T k 6 a ? he,er °9 eneous in ,hat ,he y nave Widely varying bandwidth characteristics, ranging from as taw 

as perhaps 1 0 Kbps for telephones, to 100 Mbps or more for ATM networks. As will be described, the single embedded 
.nformat.on stream is readily scaled, as needed, to accommodate a lower bandwidth network link or to adapt to network 

congestion. 

is PnJTS 20 inC T e£ a K Cen,ral Pr ° CeSSOr Un " ( " CPU,) Wi,h associa,ed memory, collectively 55, a scalable video 
TZZ hlnfc' aC «n \ 9 Pr l S Cn, , invention - a ^chanism 70 for synchronizing audio, video and textual information, 
a mechanism 80 for arranging the information processed by the scalable video encoder onto video disks 90 (or other 

T^TkTT m° iS T T ided '° r SiQna ' prOCeSSed aUdi ° in ' 0rma,ion - So,lware *e scalable 

video encoder 60 preferably ,s digitally stored within server 20, for example, within the memory associated with CPU 

» An admission control mechanism 110 is coupled to the processed video storage 90, as is a communication error 

recovery mechanism 120 for handling bit errors or packet eel, loss. The decoder algorithm provides e„TeZ*Z 
w^rn?el S e" 30 C ° mmUniCa,iOn ^ Server communicates ^ heterogeneous network(s) through a net 

* cio^TT T° T'T 60 di,ferS ' f0m the PMOr an in ,hat * Pre,erab 'y is demented in software only (e g . no 
? QO r,f m h h ar6) ' f , 9enera,eS 3 Sin9 ' e embedded in, ° rmati0n Stream ' Encoder 60 em P'^ a new video coding 
o^mLlrA f ° n a , P " Py ? mida ' decomposition ,0 generate the embedded information stream. (Laplacian 

here ) The generated embedded s.reamallows server 20 to host decoders 40, 40' having various spatial and temporal 
resolutions, without the server having to know the characteristics of the recipient decoder(s) 

J ° encodl m'T" 10 R T !' f ° ri9inal 640X480 piXe ' ima9e 200 ,r ° m SOurce 10 is cou P' ed to the scalable video 
So7 si 210 ZT ' ?• "ST 99 iS deCina ' ed (e - 9 " fi ' tered 3nd s ^-sam P led) to 320x240 pixels (image 
enX bTenc^r 658 *" ^ ™ dedmated l ° * ^ 160x1 20 pixel ^ 240 

35 , ine ?n ri ?wV! ( 0Xl20 1 PiXe, , baSe ' ayef ' enCOdin9 ,s d ° ne ° n 3 2x2 blocks two adjacent pixels on one 

■TSVO- ! , ' '"f I ° n 3 " eXt ' ine definin9 the b ' OCk) Wi ' h DCT ,0l '° wed b * '^-structured vector quantization 
J^nVr ? , ? T ° ,hal tranS, ° rm - F ° r the 320X240 ,irSt ^^ncemen. layer, encoding is done on 4x4 blocks 

:;!h h ocV:r^!:s: and ,or ,he 640x480 pixei enhancemeni ^ ™*» - d - - 8x8 s 

« at steo^ol 50 ^ 60 ' 12 ??!' 15386 ima9e 240 iS com P ressed '° <°™ * 160x120 pixel base .ayer 260 and then 
d odu P I 0 ,S deC °T P : e 0 S 0 S n ed - The reSult,n 9 d ^°^P^sed image 280 is up-sampled by interpolation step 290 to 
produce an up-sampled 320x240 pixel image 300. 

At summation step 310, the up-sampled 320x240 pixel image 300 is subtracted from the 320x240 pixel image 220 

" S pixehlof So 0 ?!' deCOmpressed ima 9 e 280 is al *° up-sampled at step 350 to produce an up-sampled 640x480 
eXZ xe^maoe ST Z ? ^ UP ' SamP ' ed 64 ° X48 ° piX6 ' ima 9 e 360 is subtrac,ed »™ ^ °"9inal 
enh^emen 320x240 D i !n ™T ^ * S,6P 39 °' ,he 6rr ° r ima 9 e 380 is ^pressed to yield a second 
TsZZt^H* Z y T T ^ * ,ransmiMed - Collectrvely. layers 260, 340 and 400 comprise the embedded 

on stream generated by the scalable video encoder 60 

three^rle tlZlT^T ^i** 3 SCa ' ab ' e V ' de ° CnCOder 60 accordin 9 to the P' esent inven «°" encodes 
240 

fnhancem S n CeT ooZ ^ T ** *" 320x240 piXel ^ 220 ' and ' he second 

ennancement layer 400 has error data for the compressed 640x480 pixel image 200 

« scalabi^wrinT^'TH "Jf VeCt0f quanti2a,ion across " a ^orm bands to embed coding to provide bandwidth 
G Z a d m ro«?ST n ^ < ' uan,iza,ion are known in the art See, for example. A 

Em^P tS V - T Quan,,2al,on and Si 9 n al Compression'. Kluwer Academic Press, 1992. 
CTSva) e o VeCl0f quanlization mav each ^ performed by tree-structured vector quantization methods 

( *>VQ ), e.g.. by a successive approximation version of vector quantization ("VQ'J. In ordinary VQ, the codewords 
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lie in an unstructured codebook. and each input vector is mapped to the minimum distortion codeword. 
Thus, VQ induces a partition ot a input space into Voronoi encoding regions. 

By contrast, when using TSVQ, the codewords are arranged in a tree structure, and each input vector is succes- 
sively mapped (from the root node) to the minimum distortion child node. As such. TSVQ induces a hierarchical partition 
or refinement of the input space as three depth of the tree increases. Because of this successive refinement, an input 
vector mapping to a leaf node can be represented with high precision by the path map from the root to the leaf, or with 
lower precision by any prefix of the path. 

Thus. TSVQ produces an embedded encoding of the data. If the depth of the tree is R and the vector dimension 
is k. then bit rates O/k, R/fccan all be achieved. To achieve further compression, the index-planes can be run- 
length coded followed by entropy coding. Algorithms for designing TSVQs and its variants have been studied exten- 
sively. The Gersoand Grey treatise cited above provides a background survey of such algorithms. 

In the prior art. mean squared error typically is used as distortion measure, with discrete cosine transforms ('DCT") 
being followed by scalar quantization. By contrast, the present embodiment performs DCT after which whole blocks 
of data are subjected to vector quantization, preferably with a perception model. 

Subjectively meaningful distortion measures are used in the design and operation of the TSVQ For this purpose 
vector transformation is made using the DCT. Next, the following input-weighted squared error is applied to the trans- 
form coefficients: 



K 

In the above equation, y f and pj are the components of the transformed vector yand of the corresponding repro- 
duction vector y. whereas mj is a component of the weight vector depending in general on y. Stated differently, distortion 
is the weighted sum of squared differences between the coefficients of the original transformed vector and the corre- 
sponding reproduced vector. 

In the described arrangement, the weights reflect human visual sensitivity to quantization errors in different trans- 
form coefficients, or bands. The weights are input-dependent to model masking effects. When used in the perceptual 
distortion measure for vector quantization, the weights control an effective stepsize. or bit allocation, for each band 
When the transform coeff.cients are vector quantized with respect to a weighted squared error distortion measure the 

role played by weights w, w K corresponds to stepsizes in the scalar quantization case. Thus, the perceptual model 

is incorporated into the VQ distortion measure, rather than into a stepsize or bit allocation algorithm. This permits the 
weights to vary with the input vector, while permitting the decoder to operate without requiring the encoder to transmit 
any side information about the weights. 

In the first stage of the compression encoder shown in Figure 2, an image is transformed using DCT The second 
stage of the encoder forms a vector of the transformed block. Next, the DCT coefficients are vector quantized using a 
TSVQ designed with a perceptually meaningful distortion measure. The encoder sends the indices as an embedded 
stream with different index planes. The first index plane contains the index for the rate 1//rTSVQ codebook The second 
index plane contains the additional index which along with the first index plane gives the index for the rate 2Jk TSVQ 
codebook. The remaining index planes similarly have part of the indices for 5/k. 4/fc . ,RA TSVQ codebooks re- 
spectively. 

Such encoding of the indices advantageously produces an embedded prioritized bitstream. Thus, rate or bandwidth 
scalability .s easily achieved by dropping index planes from the embedded bit-stream. At the receiving end the decoder 
can use the remaining embedded stream to index a TSVQ codebook of the corresponding rate 

Frame-rate scalability can be easily achieved by dropping frames, as at present no interframe compression is 
implemented in the preferred embodiment of the encoder algorithm. The algorithm further provides a perceptually 
pr.or,t.zed bit-stream because of the embedding property of TSVQ. If desired, motion estimation and/or conditional 
replenishment may also be incorporated into the system. 

Scalable compression is also important for image browsing, multimedia applications, transcoding to different for- 
mats, and embedded television standards. By prioritizing packets comprising the embedded stream, congestion due 
to contention for network bandwidth, central processor unit ("CPU-) cycles, etc.. in the dynamic environment of general 

pm£!S °° mpU,,n9 SyS,emS Can be overcome b V intelligently dropping less important packets from the transmitted 

emDeoaed stream. 

*,r J^! rmati0 ?^ tey u OU, °" ,h6 videodisk s,ora 9e system 90 (see Figure 1) preferably involves laying the video as two 
sireams, e.g., the base layer and the first and second enhancement layer streams. In practice, it is not necessary to 



The base layer data is stored as a separate stream from the enhancement layer data on disk subsystem 90 This 

S^LT r h to ad , m,t m ° re USefS Wh6n ' eWer US6rS Ch0 ° Se ,0 r6CeiVe the -^ncement layer cXa" "i now 
AT. 1 ? 3yer da,a 15 S ' 0red hierarchical1 * da,a tor each frame being stored together. Each frame £ 
a set of index planes corresponding to different number of bits used for the lookup 

The compressed stream comprises look-up indices with different number of bits depending on the bandwidth and 

T^ZZl r T t look ; up T d , ices ,or each 'r e are s,ored as 9roups °' " d ex PlanesVlo^ 

p .canon level headers for network transm.ss.on. Preferably the four most significant bits of the lookup indices are 
stored.oge.herasthef.rstsectionoftheframeW 

as separate sections of the frame block to provide lookup indices with 4. 5, 6. 7. 8 bits, respecLt The d 7eZZt 
up .nd.ces provide data streams with different bandwidth requirements 

With reference to Figure 1 , server 20 fetches the base signal frame block from the disk 90, transmits the selected 

The error data is placed similarly as another data stream. The look-up indices preferably are stored as the most 
significant two bits of the look-up indices in the first section tor each frame block in the Z ZaTZniZ* "the 
second two bits of the look-up indices as the second section, followed in turn by four additional bTselJons o 

zssxs tsmr look - up indices wi,h 2 4 5 6 - 7 8 ~ °- e^xr: 

video i as Lno P a a sI IT"' * RAID K d6Si9n ^ any feS,riC,i ° n °" ,he number of ^ * ^ 
ca?° a r e P US6rS Can 6 aCCOmmoda,ed witnin ,h * server total bandwidth. That is, the usage 

can range from all act.ve users receding the same title at different offsets to all receiving different streams 

in th RA TZ TJT T ^ Pre ' erably S,rip6d fiX6d Size units acrossthe set of drives 

in the RAID group, with parity placed on an add.tional drive. The selection of the parity drive is fixed since data uodates 
are qu.te rare compared to the number of times the streams are read. The preferred «*lng^|^^rti?S 
look-up .nd.ces for an .ndividual frame together on one disk. This allows for ease of posLin when a use sinole 

aZ u for T ^ d ' SP ' ay ' * We * 3 * SOme ,oss °' st °-ge capa'c* du >ZZgZ 

mn h h P V °" SU,Pe ,eVe ' aNOWS f ° r QUiCk feCOVer y a,,er a drive failur e at the cost of L^SuSZm, 
more buffer space to hold the full exclusive-OR recovery data set 9 suDstantial| y 

tho ^. thiS ^ ampl !' u ,he video server utili2es lh * planar bit stream format directly as the basis for the packet stream in 
ZZZtlT!- Th , eembed ? ed s,ream bils P |US application packet header are read from to 9 a d aTtran 

thI Si™ . ° ' n ! XaCt ' y the S3me f ° rmat - F0r examp,e ' in ,he ? re,e " ed embodiment the base vSCTto 
toe our most significant b.ts o, the look-up indices stored together. Thus, those bits are transmitted LZzZotZ 

t! k L h T^' ind6X bit P ' ane ° f the ' eSS Si9nificant bits is transmi «* d as a separate 640 byte packet 
hi, J ? 6rably C ° n,ainS 3 ' rame Sequence number " nomi "al frame rate, size a virtual 7me sLTand a 

bit plane type specifier sufficient to make each packet an identifiable stand-alone unit The sCer usTs the seif idl 

•ion from the hco^S^SS!? T 3 mechanism 1 45 ,or synchronizing audio and video informa- 

for ex mpf^ deCOd '' n9 Pr0C6SS a ' 9 ° rithm > M * are stored * ™™ry. 

are not ^^^2321. *TT* CPU Al,emat " el * in where full CPU operations 

mented in hardware ^ a ^SS 3P f TV*"* 6 ' 8 aCC ° rdin9 l ° ,hS preSent inVen,i0n ™* be im P le - 
simple centra. p resso?unU CP S, ^ mem0ry ( " ROM,) unit 155 Wi,hin unit 155 is a relatively 

produced for a few do °a rS ^ "* aSSOCia ' ed R ° M> repr6SentS 3 hardware unit that ma V b * 



Target decoder system 40 should be able to define at least 160x120. 320x240, 640x480 pixel spatial resolutions, 
and at least 1 to 30 frames per second temporal resolution. Decoder system 40 must also accommodate bandwidth 
scalability with a dynamic range of video data from 10 kbps to 10 kbps to 10 Mbps. In this arrangement, video encoder 
60 provides a single embedded stream from which different streams at different spatial and temporal resolutions and 
different data rales can be extracted by decoders 40. depending on decoder capabilities and requirements. However, 
as noted, encoder embedding is independent of the characteristics of the decoder(s) that will receive the single em- 
bedded information stream. 

For example, decoder 40 can include search engines that permit a user to browse material for relevant segments, 
perhaps news t that the user may then select tor full review. Within server 20. video storage 90 migrates the full reso- 
lution, full frame rate news stories based on their age and access history from disk to CD ROM to tape, leaving lower 
resolution versions behind to support the browsing operation. If a news segment becomes more popular or important, 
the higher resolution can then be retrieved and stored at a more accessible portion of the storage hierarchy 90. 

The decoder(s) may be software-based and merely use the indices from the embedded bit-stream to look-up from 
a codebook that is designed to make efficient use of the cache memory associated with the CPU unit 140. In this 
arrangement, video stream decoding is straightforward, and consists of loading the codebooks into the CPU cache 
memory, and performing look-ups from the stored codebook tables. In practice, the codebook may be stored in less 
than about 12 Kb of cache memory. 

Video decoder 1 60 may be software-based and uses a Laplacian pyramid decoding algorithm, and preferably can 
support up to three spatial resolutions, i.e., 160x120 pixels. 320x240 pixels, and 640x480 pixels. Further, decoder 160 
can support any frame rate, as the frames are coded independently by encoder 60. 

The decoding methodology is shown in Figure 3. To decode a 160x120 pixel image, decoder 160 at method step 
410 need only decompress the base layer 1 60x1 20 pixel image 260. The resultant image 430 is copied to video monitor 
(or other device) 1 80. APPENDIX 1 , attached hereto, is a sample of decompression as used with the present invention. 

To obtain a 320x240 pixel image, decoder 160 first decompresses (step 410) the base layer 260. and then at step 
440 up-samples to yield an image 450 having the correct spatial resolution, e.g.. 320x240 pixels. Next, at step 460. 
the error data in the first enhancement layer 340 is decompressed. The decompressed image 470 is then added at 
step 480 to up-sampled base image 450. The resultant 320x240 pixel image 490 is coupled by decoder 1 60 to a suitable 
display mechanism 180. 

To obtain a 640x480 pixel image, the up-sampled 320x240 pixel image 450 is up-sampled at step 500 to yield an 
image 510 having the correct spatial resolution, e.g.. 640x480 pixels. Next, at step 520, the error data in the second 
enhancement layer 400 is decompressed. The decompressed image 530 is added at step 540 to the up-sampled base 
image 510. The resultant 640x480 pixel image 550 is coupled by decoder 160 to a suitable display mechanism 180. 

As seen from Figure 3 and the above-description, it will be appreciated that obtaining the base layer from the 
embedded bit stream requires only look-ups. whereas obtaining the enhancement layers involves performing look-ups 
of the base and error images, followed by an addition process. Preferably the decoder is software-based and operates 
rapidly in that all decoder operations are actually performed beforehand, i.e., by preprocessing. The TSVQ decoder 
codebook contains the inverse DCT performed on the codewords of the encoder codebook. As noted, in applications 
such as video displays where a complex CPU 140 would not necessarily be present, the video decoder may be imple- 
mented in hardware, e.g.. by storing the functions needed for decoding in ROM 155, or the equivalent. In practice 
ROM 1 55 may be as small as about 1 2 Kb. 

Thus, at the decoder there is no need for performing inverse block transforms. Color conversion, i.e., YUV to RGB. 
is also performed as a pre-processing step by storing the corresponding color converted codebook. To display video 
on a limited color palette display, the resulting codewords of the decoder codebook are quantized using a color quan- 
tization algorithm. One such algorithm has been proposed by applicant Chaddha et al M "Fast Vector Quantization 
Algorithms for Color Palette Design Based on Human Vision Perception,' accepted for publication IEEE Transactions 
on Image Processing. 

In this arrangement, color conversion involves forming a RGB or YUV color vector from the codebook codewords, 
which are then color quantizing to the required alphabet size. Thus, the same embedded index stream can be used 
for displaying images on different alphabet decoders that have the appropriate codebooks with the correct alphabet 
size, e.g., 1-bit to 24-bit color. 

On the receiving end, the video decoder 40. 40' is responsible for reassembly of the lookup indices from the packets 
received from the network. If one of the less significant index bit plane packets is somehow lost, the decoder uses the 
more significant bits to construct a shorter look-up table index. This yields a lower quality but still recognizable image. 

The use of separately identified packets containing index bit planes makes it possible for networks to easily scale 
the video as a side effect of dropping less important packets. In networks providing QOS qualifiers such as ATM 
multiple circuits can be used to indicate the order in which packets should be dropped (i.e., the least significant bit 
plane packets first). In an IP router environment, packet filters can be constructed to appropriately discard less important 
packets. For prioritized networks, the base layer will be sent on the high priority channel while the enhancement layer 



^unZTT" l0W K Pfi ° rity Channel T ° Pf ° Vide erf0r feSi ' ienCy - USi " 9 3 ,iX6d - ra,e codin 9 scheme with some added 
redundancy allows robustness in the event of packet loss. 

DO inMpl^ PPreCia ! ed a ' erVef aCCOfdin9 ,0 ,he PreSent inVen,ion can su PP° rt tw ° ^age scenarios: point-to- 
40 in Figure 1 ) " d6COderS 4 °' Fi9Ufe 1 1 " " mU " iCaSt (6 ' 9 - netWOrk Cloud 50 - ne,WOfks 30 - ^ers 

,h e c l l a o P TK t ' ,0 ' POin, 5 emand environment - each destination system decoder presents its specific requirements to 
the se^er. The server then sends the selected elements of the embedded stream across the network to the destination 

L?r a ' !TT k S ' rea T Pef deS,ina,i ° n a " OWS ,he USef 10 h3Ve VCR s,vle '""ctionality such as play/s.op/rewind 
fast forward/fast reverse. If congestion occurs on the network, the routers and switches can intelligently drop packets 
from the embedded stream to give a lesser number of lookup bits 'en.genuy arop packets 

In a multicast environment, the server, which has no information about the destination decoders, outputs the entire 
th^e^ 

s oiri, h f ^ ,rees - de P endin 9° n lhe Sranuterity of traffic control desired. The primary traffic management 
.s performed dunng the constructs of the unicast trees, by not adding branches of the trees carrying the less important 
bit streams to the lower bandwidth networks. The network in this case takes care of bandwidth misr^tchX no 
orwardmg packets to the networks which are not subscribed to a particular tree. Switches and routeTcan s ilUeac 
to temporary congest™ by intelligently dropping packets from the embedded stream to deliver fewer MsoZ lp 
the 2lT k V SyStem t ,fea,S ,he T audi0 ,rack as a se P ara,e that is stored on disk 100 and transmitted^ 
mu awHo 4« KH 7 eP t ar """I* aUdl '° SUPP ° rtS mU ' ,ip,e d3ta forma,s " om 8 ™ 2 Wl*ony quality 8 brt 
Z^llZ T?° ' y aUd '° (2 ChanneL 16 M ' inear Samp,es) - ,n P ractice - ma "V ^eo clips may have 8 
T» T T P r m " ma,er,a ' diS,nbU,i ° n ° Ver medium -^'ow bandwidth networks. The server can store 
separate h.gh and low quality audio tracks, and transmit the audio track selected by the user. As the audio transits The 

load 1 Z 3 T'tTT' aUdi ° ^ 6aSi,y 66 9iV6n 3 hl ' 9her °° S than ,he vide ° RathenhTntrther 

dow o s n° 1 P IC f ,e aUdi ° PaCketS ' 85 iS kn ° Wn ,n ,he Pri0r art " in ,he P resent inv ^™ ^io is «S 
down to silence when packets are overly delayed or lost. M 

As the audio and video are delivered via independent mechanisms to the decoding system, the two streams must 

Se ! wrmen ' mem ° ry re9i ° n ' int ° ^ ** S6qUenCe inf ° rmati ° n °' ,he CU " ent audio and video d^ 

thanll^oZno 6 ? 8 ' SyStem 'VT' 6 SenSi,lVe l ° aUdi ° dr ° P0UtS ,han '° video dr °P s - and au dio is more difficult 
than v,deo to temporary reprocess. Thus, the decoder preferably uses the audio coder as the master clock for svn- 

ontTa'^ 

vIoh bla f b ° ard " 0r scratch P ad P or,ion ° f memofy associated with CPU unit 140. The slave threads (such as the 
d*^ 

oe a.spiayed. The slave threads then delay unt.l the appropriate time if the slave is early (e g . more than 80 ms ahead 
tha^rn U t slave data is too late (e.g., more than 20 ms behind the audio), then iUs discarded on^he assumptJon 

that continuing to process late data will delay more timely data assumption 

si^o Vide ° de ( C0der K can °P tional| y ™asure the deviation from the desired data delay rate and send speed-up and 

i^^S^T '° I" 6 Vid6 ° SerV6r - ThiS Pf0CeSS Svnchron - S creams whole elements aZ^t 
fash on and does not allow a slow stream to impede the progress of the other streams 

«,„ . . e , Ven °' SCarC ' ly ° f resources - some 9lobal prioritization of user requests must take place to ouard aoainst 
over oad collapse. In a practical system, payment for services and resources may be used to define he'ovefa'.va lue 

iTsZ^lTTaZ rf^ ^ °™ ^ * ^ ^i" 9 « the ™ ^ can be made^g by 
tnnZ ' he l6SS ' mportant requests can be dr °PP ed - ^e user specifies what he or she is wVlino 

Ire' m^rorrr ThiS ^ *« *» ^ 3SS0ciated ™»™ a^anZS) 

amount 7bllZ* 6 ' 9 '' admiSSi ° n COntr0 ' uses micro-economic models to decide wha 

market « < • 7 " f™^* ,0 ,he USer Such ,echni « ues ™ known in the art. e.g.. M. Miller. 'Etf enX, 

markets inward. Bionomics Conference. San Francisco. California (Oct 1994) <=xienamg 

framed In^T^'T 3 *** '* ,0 find the best P ossible combination of spatia. resolution 

uch e la s a ; d ^ 

vl.ua I based SJJT 9 ' T °" ™* SWe ' SUCh 35 described bv N " Chaddha and T - " v - M^ng, 'Psycho^ 
andCot^ 

bandwidth d'ectly a ' S ° ha§ the ° pti0n « specifvin 9 ,he s P atial «*™ rate and 

common aTgorhrdTmrna' 68 "?' ° Vera " ^ 3 -"^ased encoder with an encoding 

nism ^,o provid^ a TtfoLtTJiT^ soflware - based ^coder. and synchrony mecha 
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The processing components include audio capture, video capture, video compression, and a data stripping tool 
The video ,s captured and digitized using single step VCR devices. Each frame is then compressed ofi-L (non. 
real time) using the encoding algorithm. At present, it takes about one second on a SparcStation 20 Workstation to 
compress a frame of video data and single step VCR devices can step at a one frame per second rate permitting 
overlap of capture and compression. K s 

The audio data preferably is captured as a single pass over the tape. The audio and video time stamps and se- 
quence numbers are aligned by the data striping tool as the video is stored to facilitate later media synchronization 
The audio and v.deo data preferably are striped onto the disks with a user-selected stripe size. In a preferred embed- 
ment all of the video data on the se rver uses a 48 kilobyte stripe size, as 48 kilobytes per disk transfer provides qood 
ut.hzat.on at peak load with approximately 50% of the disk bandwidth delivering data to the media server components 
mk J?n I" ed '** e ™ com P° nen, s '"elude a session control agent, the audio transmission agent, and the video trans- 
m.ss,on agent. The user connects to the session control agent on the server system and arranges to pay for the video 

Z£S ^K ne,WOrk , b rK dW ! dth - US6r C3n SP6Ci,y ,he COSt he/she is willin 9 «° ™« « appropriately scaled 
unTnJ b ?:°r ided by the SerVer " The session con,rol ^ent (e.g.. admission contro. mechanism 110) then sets 

con,ro ' opera,ions ,rom ,he consumers remo,e contro '- ,he M ma ~ 

of thldattol 3 ^ ni?° , ; an T S h miSS J° n a9en,S fead m6dia d3,a fr0m ,he Striped disks and P ace ,he transmission 
t ?, 6 B TL 6 V,de ° ,ransmission a 9 ent scal ** the embedded bit-stream in real-time by transmitting 

8 b L cf AT', h ,0 feCOnStrUCt S6,eCted reSOlU,i ° n a! ,he deCOder ' For exam P' e . a 320x24b stream with 
« ttu , ; ° enhancement signal at 1 5 frames per second will transmit every other frame of video data with 

menSr .r ^7°' *' b3Se ,W ° P3Cke,S COntainin 9 ,he ,our most si 9 nifica "t bits of the enhance- 

T 864 °' The ««" sends ,he video ™* audio either for a point-to-point 

situation or a multicast situation. H 

far/JL??? p ! aye ' com P onents are "^e software based video decoder 40. 40'. the audio receiver, and a user inter- 

onxl m^^SX?^ d3,a ,f0m ne,W ° fk 8nd d6COdeS " USin9 ,0 ° k - up teb,es and P ,aces tne resul * 
rZ ^JnT ■ 6 d6COder C3n fUn ° n 3ny m0dern micr °Processor unit without the CPU loading significantly 

IlrTn? 7 , T read,n9 d3ta ,r ° m ,he netW ° rk and queuin 9 "P ,or the data for °"tput to the speaker In the 
Zi L T f ^ T T aUdi ° rGCeiVer Wi " famp ,he 8Udi0 level down * silence level and then back up o the 

anon h aS° 0 w T" 3Udi ° ^ The SyS,em " erforms ™ dia synchronization o 

align the aud.o and v.deo streams at the destination, using techniques such as described by J. D. Northcutt and E M 
Kuemer. System Support for Time-Critical applications," Proc. NOSSDAV 91, Germany, pp 242-254 

slaved nd o t ,hTf,n ,e f ^ US6d in ° n d6mand C3Se '° COn,r °' ,he flow ln lhe mul,icas, case . t^ destinations are 

s ss on a en on JTJS "° 'T^ ^ ' nterfaCe a9ent S6fVes as ,he contro1 section to the 

session agent on the media server passing flow control feedback as well as the user's start/stop controls The user 
can specify the cos. he or she is willing to pay and an appropriate stream wi.l be provided by the system 
d. JnJ r ° t0type k system embodying the present invention uses a video data rate that varies from 1 9.2 kbps to 2 Mbps 

TesZZ ^3 d^r 3 7TrT e h ? Ui ; ement °k ^ ^ *' ne,W ° fk avai,able The PSN * 

SoarcSta ton 20 , ~ fh , ?T ^ ^ ^ '° r the deCOdin9 of 3 160x120 resolu,ion video °" a 
feox 20 esolutio m « S6en T m Ta ^' e 1 ,h , at ,ime r6qUired l ° 961 the hi 9 hest ^ s,ream ™**) at 
iuS^SS^ <SUm " P PaCkinQ ,ime) - ThiS corres P° nds ,0 a Potential frame rate 
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TABLE 1. 

RESULTS FOR 160x120 RESOLUTION (DECODER) 



50 



55 



No. of Bits of Lookup 



PSNR (dB) 



31.63 dB 
32.50 dB 
34 dB 
35.8 dB 



Bandwidth as a 
function of frame rate 
(N) Kbps 

19.2N 

24N 

28.8N 

33.6N 



CPU time per frame 
(ms) 

1.24 ms 
1.32 ms 
1.26 ms 
1.10ms 



Packing time per frame 
(ms) 

0 ms 
0.52 ms 
0.80 ms 
1.09 ms 



Similarly, Table 2 gives the results for the decoding of 
seen from Table 2 that the time required to get the highest 
layer index) at 320x240 resolution is 7.76 ms per frame 
potential frame rale of 130 frames/sec. 



a 320x240 resolution video on a SparcStation 20. It can be 
quality stream (8-bit base index and 8-bit first enhancement 
(sum of look-up and packing time). This corresponds, to a 



TABLE 2. RESULTS FOR 320x24 0 RESOLUTION 
(8 BIT-LOOKUP BASE) 



Mo. of Bits of 
Lookup 


PSWR (oB) 


Bandwidth as * 

function of 
frame rate (H) 

rbps 


CPU time per 
frame <•») 


Packing tine 
per frame (at) 


2 


33.72 cB 


48* 




0.385 ms 1 


4 


35. 0 dB 


52.fi* 


6.04 as 


0.645 ms I 


5 


35.65 dB 


62.4N 


6.05 as 


0,92 ms | 



6 


36.26 d8 


67.2N 


6.08 ms 


1.20 ms 


7 


36.9 dS 


72W 


6.04 ms 


1.48 ms 


8 


37.5 dB 


76.8W 


6.09 ms 


1.67 ms 



f rom ^ thG deCOdin9 ° f 3 640X480 reSOlUti ° n Vide0 a 9 ain on 3 SparcStation 20. It can be seen 

r^esZc mS (SUm ° f IO ° kUP P3Cking time) " ™ iS co ^P° nds ^ a potential frame rate of 40 



TABLE 3. 

RESULTS FOR 640x480 WITH 320x240 INTERPOLATED 



No. of Bits of Lookup 



PSNR (dB) 



Bandwidth as a 
function of frame rate 
(N) Kbps 



CPU time per frame 
(ms) 



Packing time per frame 
(ms) 



33.2 dB 



48N 



22.8 ms 



0.385 ms 



4 
5 



34 dB 
34.34 dB 



52.8N 



22.87 ms 



0.645 ms 



6 
7 
8 



62.4N 



23.14 ms 



0.92 ms 



34.71 dB 
35.07 dB 
35.34 dB 



67.2N 
72N 
76.8N 



22.93 ms 
22.90 ms 
22.95 ms 



1.20 ms 
1.48 ms 
1.67 ms 



TABLE 4. RESULTS FOR 160x120 AT THE DISK SERVER 



Ko. of Bits 
of Lookup 


Bandwidth as 
a function of 
frame rate 
(M) Kbps 


CPU time per 
frame (ms) 


Seek-time 
(ms) 


Avg. CPU 
Load 


4 


19. 2M 


2.84 ms 


16 ms 


1 X 


5 


24N 


3.67 ms 


16 ms 


1 X 


6 


28. 8N 


4.48 ms 


14 ms 


2 X 


7 


33. 6N 


4.92 ms 


14 ms 


2 X 




Similarly. Table 5 shows the results for each individual disk for 320x240 resolution video. It can be seen that 
obtaining the highest quality stream (8-bit base and 8-bit enhancement layer) at 320x240 requires 12 73 ms of CPU 
time and an average CPU load of 7% on a SparcStation 20 workstation. The average disk access time per frame is 

1 8 ms. 



TABLE 5. 





RESULTS FOR 320x240 AT THE DISK SERVER 


No. of Bits of Lookup 


Bandwidth as a function 
of frame rate (N) Kbps 


CPU time per frame (ms) 


Seek-time (ms) 


Avg. CPU Load 


2 


48N 


10.47 ms 


18 ms 


6% 


4 


52.8N 


11.02 ms 


16 ms 


6% 


5 


62.4N 


11.55 ms 


18 ms 


6% 


6 


67.2N 


12.29 ms 


20 ms 


7% 


7 


72N 


12.55 ms 


20 ms 


7% 


8 


76.8N 


12.73 ms 


18 ms 


7% 



/• 

* Scalable Video Di splay er — Mala Program 

* Copyright 1995 Sun Kicroeyetema, Inc. 



typed f signed lnt errorcode; 



pixel *baaeloofcupcodebook; 
errorcode *errorcodebook; 

errorcode * large code book; 

/• 

* invert the single eignt bit base index into the base 2x2 codeboo* 

• and atore into the destination image 
•/ 

atatic void 

BasePixelinvert (pixel *deat, int stride/ int index) 

index «- 2; /• index into the 4 pixel chunk ♦/ 

I, b>a a elookupcodebook Tindex j / /• the firat pi*el on line 0 •/ 

*M«t41) - b*aelookupcodebcok(index + l] ; /• the aecond pixe! on Une 0 V 

♦(dest+atride) - baselookupcodebookfindex+2] ; /• th^ «r«t ,<v fl i ^ -i - 

* etep through the input indiciea and 

* S^f^ 6 ** f lgh v bit index tot0 **" codebook 

* and atore into the deatination image at the 

* correct location 



each Input index gives 4 destination pixels 



void 

Ba 2el nvert(iat nv, int nh, pixel -iaage, ^ strlde , ^ char . aourcfiJ 



int X, y; 
int index; 

unsigned char "dest; 



ror (y - nv; y > 0; — y) { 
dest - image; 
toz (x - nh; x > 0; — x) { 
index - •eourc«++; 

BasePixelInvert(dest / stride, index); 
j deat tm 2; /* step to the next destination pixel */ 

} i»ge 4. 2 * stride; /• ate? t0 the 8tart of the m tw llaM #/ 



• iSTSto^gJ 2J SSi?^ r aues lnt0 value 
:/ index°vtSs V S SoTSfJL^'" 11 ' ^ «** 
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static void 

ErrorPixellnvcxt (pixel *deat, int etxide, pixel baa epixel, Aat err* r_index) 
i 

/* ach destination bue pixel gets combined vitb four erro; values */ 

♦(deat) - baeejiixel + errorlookupcodebo*kterror_indexl; 

* (dest+1) • baae_pixel + err rlookupcodeb ok[error_index+l # J 

* (dest+stride) - baee_pixel + erroriookupcodebooMerror_in<ex+4]; 

* (deat*atride+l) - baae_pixel + errorlookupcodebook [error ;ndex+5]j 

} 

/• 

* step through the input indiciea and 

* invert each eight bit base index into the bue 2x2 codebook 

* and invert each eight bit error index into the error 4x4 codebool 

* combine the results and then etore into the destination image 

* at the correct location 
• 

* each pair of input indiciea gives 16 destination pixels 
V 



Errorlnvert (nv, nh, ' image, stride/ source/ errorband) 
lnt nv; 
int nh; 
pixel *irage; 
int stride; 

unsigned char "source; 
unsigned char *errorband; 
{ 

unsigned char *linel, *iine3; 
int base_index, error_index; 
pixel basejpixel; ~~ 



while (nv > 0) { 
— nv; 

linel «= imac/e; 

line3 - linel + (2 * stride); 
for (x « nh; x > 0; — x) { 

/* index into 4 pixel chunk for base 2x2*/ 
base_ index - *source++; 
base~index «- 2; 

/* index into the 16 pixel chunk error 4x4*/ 
error_index - Terrorband++; 
error_index «- 4; 

basejixel «= baeelookupcodebook[baae index+0 ; 
ErrorPixellnvert (linel, etride, basej>ixel, irror index); 
linel 4- 2; "* 
error_index 4- 2; 

baae_pixel ■ baselookupcodebookfbaae index+1 ; 
ErrorPixellnvert (linel, stride, baaejixel, .xror index); 
linel +-2; 
error_index 4~ 6; 

base_pixel - baa elookupcodebook (base index 42 ; 
ErrorPixellnvert <line3, etride, baee~pixel„ <rror index); 
line3 «*■- 2; ~~ 
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error_index 2/ 

base_pixel - baflelookvpcodeboo)tibaee_lndex+: ] i 
rrrorPLxelXnvert(liae3, stride, base_pixei, error_index) ; 
Xlac3 2; 

} 

iiaage 4 ♦ stride; /* otep to next oet of four lU^o •/ 



display agent () 
{ 



iat abase; 
int nerror; 

unsigned char *baeedataj 
unsigned char *errordata; 

while (global_etatu© 1- EXIT) i 

nbace - AesefflbleBaaePecketsrroaNetvorkUbaeedata) ; 

SetBaseCodeBook (nbaae) ; 

nerror « AssejtbleErrorPacketarroniN6twor)c<«errordata; ; 
S etErr or Cod eBook (abase) ; 

if (large_dieplay) 

Error Invert (12 0, 160, XiirAge-^data, x Image _t ldth, basedata, errordai 

else 

Baselavert (120, 160, Ximage->data, xir,age_t idth, basedata); 
/• use standard Xll put image to display the result ♦/ 

XPutlxrage (display, xid, gc, Xijrage, 0, 0, 0, 0, xiia ge_vidtb , ximage_heigh 
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Claims 



1. For -use in a video delivery system server having a source of video images, an encoder providing an embedded 
bit stream containing information including image data at at least two spatial resolutions, said embedded bit stream 
being sent over at least one network to at least one decoder, the encoder including: 

a central processor unit coupled to a memory unit; 

encoding means, digitally stored in said memory unit and coupled to said source of video images and receiving 
a first image at first spatial resolution, for decimating said first image to form an first intermediate image at half 
said highest resolution, for decimating said first intermediate image to form a second intermediate image, for 
compressing said second intermediate image to form a base layer image whose resolution is less than said 
first image; 

said encoding means further decompressing said base layer image to form a third intermediate image, and 
interpolating said third intermediate image to form a fourth intermediate image, and for subtracting said fourth 
intermediate image from said first intermediate image to form a fifth intermediate image, and for compressing 
said fifth intermediate image to form a first enhancement layer image whose resolution is less than said first 
image but greater than said base layer image; 

said embedded bit stream containing at least said base layer image and said first enhancement layer image. 

2. The encoder of claim 1 , wherein said embedded bit stream contains at least three spatial resolutions including an 
additional image, whose resolution equals that of said first image, and said base layer image, and wherein said 
encoding means further interpolates said fourth intermediate image to form a sixth intermediate image that is 
subtracted from said first image to form a seventh intermediate image that is compressed to form a second en- 
hancement layer image whose resolution equals that of said first image; 

said embedded bit stream further containing said second enhancement layer image. 

3. The encoder of claim 1 , wherein said embedded bit stream includes spatial resolution data encoded in pixel blocks, 
and wherein said encoding means encodes said spatial resolution data using a discrete cosine transformation 
followed by a tree-structured vector quantization upon results of said transformation. 

4. The encoder of claim 2. wherein said first imag<2 has a resolution of 640x480 pixels, said additional image has a 
resolution of 320x240 pixels, and said base layer image has a resolution of 160x120 pixels. 

5. The encoder of claim 4, wherein said embedded bit stream includes spatial resolution data encoded in pixel blocks 
of size 2x2 bits for said base layer image, of size 4x4 bits for said additional image, and of size 8x8 bits for said 
first image, and wherein said encoding means encodes said spatial resolution data using a discrete cosine trans- 
formation followed by a tree-structured vector quantization upon results of said transformation. 

6. The encoder of claim 3, wherein transform coefficients include input-weighted squared error defined as follows: 



K 

where y i and yj are components of a transformed vector y and of a corresponding reproduction vector y, and where 
is a component of a weight vector generally dependent only upon y 

7. The encoder of claim 6. wherein said tree-structured vector quantization includes a perception model. 

8. The encoder of claim 7, wherein said weight vector components reflect human visual sensitivity to quantization 
errors in different transform coefficients. 

9. The encoder of claim 7, wherein said tree-structured vector quantization has a tree depth R and has a vector 
dimension is K and wherein bitstream bit rates Oik , R/*are provided. 

10. The encoder of claim 7, wherein indices are transmitted in said embedded stream with different index planes; 



a first index plane con.aining a firs, index for a rate Vk tree-structured vector quan.iza.ion reference, and . ■ 
a second ,ndex plane contam.ng a second index for a rate 21k tree-struc.ured vector quantization reference. 

11 ' a^socSed ^ITl?' "T^ S,ream inC ' UdeS data packe,s ' and wherein «W indices are 

associated with a relative priority of importance o( at least some of said data packets. 

.ive,y 2^,22: Pri0ri,V aSSOCiat6d ^ Sa,d ^ ^ P6rmitS S6leC,iVe of re,a- 

13 ' IT^ZTl^ f Wherei " da,a ,or each video ,ra ™ * «*« together, and wherein each frame has an 
associated set of index planes with associated packet headers. 

14. The encoder of claim 1 . wherein said encoding means further includes at least one option selected from the group 
said itTe eSt ' ma, ' 0n °' m °' i0n ^ & ^ M ^ (b > C ° ndi,i ° nal -P'^-nt Z Z 

15. A method, for use in a video delivery system server having a source of video images, of encoding an embedded 

S^oT-n, m 9 ,n, ° rmati0n inC ' Udin9 ima9e d3,a a ' 31 ' eaSt ,W ° Spa,ial said embedded bts ream 

being sent over a. least one network to at least one decoder, the method including the following steps: 

(a) providing a central processor unit coupled to a memory unif and 

(b) , providing encoding means, digitally stored in said memory unit and coupled to said source of video images 

TsZ Z - SnZr^ Spatia, r 0lU,ion ' (or decimatin 9 «* ««« image to form an first intermedin 
mage at half said highest resolution, for decimating said first intermediate image to form a second intermediate 

ZLlTfi^gT 8 S6COnd in,6rmedia,e ' ma9e 10 f ° rm 3 b9Se teye ' ima ^ -solulfon is ,ess 

in!^lf m9 m H e r S w ,Urther decompressin 9 said ba « layer image to form a third intermediate image and 

n e mJd a,e £? 0 1 T '^T*' *° 3 ,0Urth int6rmedia,e ™^ and for said' fourth 

a7d Z Z 2Z I , f '"ermediate i™ge to form a fifth intermediate image, and for compressing 

^q^nZ^hl3h T 3 enhanCement ^ ima 9 e «*«o.e ^solution is less than said firs' 
image out greater than said base layer image; 

said embedded bit stream containing a. least said base layer image and said first enhancement layer image. 

16 ' ITZT 06 ? C ' aim V 5 ' Wh6rein S3id 6mbedded bit s,ream con,ains at 'east three spatial resolutions includinq 
sten S ,0 : a H ma9e . Wh ° Se reSO,Uti ° n eqUa ' S that « said ,irst ima 9 e ' and * aid ^se layer image and wS ein a t 

Zl ^S^^JT:™ 9 ^ imermediate ima9e »«™ a ' ^in,:rme afe 

3 a, S s^tracted from said first image to form a seventh intermediate image that is compressed to form a 
second enhancement layer image whose resolution equals that of said first image- Compressed ,0 torm a 

said embedded bit stream further containing said second enhancement layer image. 

1 7 " LI SefnauS £ saiZ ^ ™ ^ ^ feSOlUli ° n *«■ encoded in PW blocks. 
.ormatioTfS 

18 ' Is MoT °' C ' aim 1 ? ' Wher6in 31 St6P (b)> ,ranSf ° rm COefficien,s include ^Put-weighted squared error defined 
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where y { and \ are components of a transformed vector yand of a corresponding reproduction vector ?. and where 
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mj is a component of a weight vector generally dependent only upon y. 

19. The method of claim 18, wherein step (b) includes providing said weight vector components that reflect human 
visual sensitivity to quantization errors in different transform coefficients. 

20. The method of claim 18. wherein at step (b), said tree-structured vector quantization has a tree depth Hand has 
a vector dimension is k, and wherein bitstream bit rates O/k, PJk are provided. 

21. The method of claim 18, wherein said embedded bit stream includes data packets, and wherein at step (b), indices 
are transmitted in said embedded stream with different index planes; 

a first index plane containing a first index for a rate Mk tree-structured vector quantization reference, and 
a second index plane containing a second index for a rate 2Jk tree-structured vector quantization reference- 
said indices being associated with a relative priority of importance of at least some of said data packets 
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Encoder for an end-to-end scalable video delivery system 



(57) A software-based encoder is provided for an 
end-to-end scalable video delivery system that operates 
over heterogeneous networks. The encoder utilizes a 
scalable video compression algorithm based on a La* 
placian pyramid decomposition to generate an embed- 
ded information stream. The decoder decimates a high- 
est resolution original image, e.g.. 640x460 pixels, to 
produce an intermediate 320x240 pixel image that is 
decimated lo produce an intermediate 160x120 pixel im- 
age that is compressed to form an encodable base layer 
160x120 pixel image. This base layer image is decom- 
pressed to form an image that is up-sampled by inter- 
polation to produce an up-sampled 320x240 pixel im- 
age. This up-sampled image is subtracted from the in- 



termediate 320x240 pixel image to form an error image 
that is compressed and encoded as a first enhancement 
640x480 pixel layer. The decompressed base layer im- 
age is also up-sampled at step to produce an up-sam- 
pled 640x480 pixel image that is subtracted from the 
original 640x480 pixel image 200 to yield an error image 
that is compressed to yield a second enhancement 
320x240 pixel layer. Collectively, the base and enhance- 
ment layers comprise the transmitted embedded bit 
stream. At the receiving end. the decoder extracts from 
the embedded stream different streams at different spa- 
tial and temporal resolutions. Because decoding re- 
quires only additions and look-ups from a small stored 
table, decoding occurs in real-time. 
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